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Abstract 

Bagging and boosting, two effective machine learn- 
ing techniques, are applied to natural language pars- 
ing. Experiments using these techniques with a 
trainable statistical parser are described. The best 
resulting system provides roughly as large of a gain 
in F-measure as doubling the corpus size. Error 
analysis of the result of the boosting technique re- 
veals some inconsistent annotations in the Penn 
Treebank, suggesting a semi-automatic method for 
finding inconsistent treebank annotations. 

1 Introduction 



Henderson and Brill (1999) showed that independent 
human research efforts produce parsers that can be 
combined for an overall boost in accuracy. Finding 
an ensemble of parsers designed to complement each 
other is clearly desirable. The parsers would need 
to be the result of a unified research effort, though, 
in which the errors made by one parser are targeted 
with priority by the developer of another parser. 

A set of five parsers which each achieve only 40% 
exact sentence accuracy would be extremely valu- 
able if they made errors in such a way that at least 
two of the five were correct on any given sentence 
(and the others abstained or were wrong in different 
ways). 100% sentence accuracy could be achieved 
by selecting the hypothesis that was proposed by 
the two parsers that agreed completely. 

In this paper, the task of automatically creating 
complementary parsers is separated from the task of 
creating a single parser. This facilitates study of the 
ensemble creation techniques in isolation. The result 
is a method for increasing parsing performance by 
creating an ensemble of parsers, each produced from 
data using the same parser induction algorithm. 

2 Bagging and Parsing 

2.1 Background 

The work of [Efron and Tibshirani (1993| ) enabled 
Breiman's refinement and ap plication of thei r tech- 
niques for machine learning ( Breiman, 1996 ). His 
technique is called bagging, short for "bootstrap ag- 
gregating". In brief, bootstrap techniques and bag- 



ging in particular reduce the systematic biases many 
estimation techniques introduce by aggregating es- 
timates made from randomly drawn representative 
resamplings of those datasets. 

Bagging attempts to find a set of classifiers which 
are consistent with the training data, different from 
each other, and distributed such that the aggregate 
sample distribution approaches the distribution of 
samples in the training set. 

Algorithm: Bagging Predictors 

(Breiman, 1996) (1) 

Given: training set L = {(yi,Xi),i G {l...m}} 
drawn from the set A of possible training sets where 
Ui is the label for example Xi , classification induction 
algorithm "J" : A — * $ with classification algorithm 
4> e $ and (j) : X —y Y. 

1. Create k bootstrap replicates of C by sampling 
m items from C with replacement. Call them 
L\ . . . Lfc. 

2. For each j e {1 . . . k}, Let fa = be the 
classifier induced using Lj as the training set. 

3. If Y is a discrete set, then for each X{ observed 
in the test set, yi = mode((f)j (x^ . . .<f>j(xi)). yi 
is the value predicted by the most predictors, 
the majority vote. 

2.2 Bagging for Parsing 

An algorithm that applies the technique of bagging 
to parsing is given in Algorithm Previous work on 
combining independent parsers is leveraged to pro- 
duce the combined parser. The rest of the algorithm 
is a straightforward transformation of bagging for 
classifiers. Exploratory work in this vein was de- 
scribed by Hajic et al. (199S). 



Algorithm: Bagging A Parser (2) 

Given: A corpus (again as a function) C : SxT — » N, 
S is the set of possible sentences, and T is the set 
of trees, with size rn = \C\ = ^2 S t C{s,t) and parser 
induction algorithm g. 

1. Draw k bootstrap replicates C\ . . . Ck of C each 
containing m samples of (s,t) pairs randomly 



picked from the domain of C according to the 
distribution D(s,i) = C(s,t)/\C\. Each boot- 
strap replicate is a bag of samples, where each 
sample in a bag is drawn randomly with replace- 
ment from the bag corresponding to C. 

2. Create parser fa = g(Ci) for each i. 

3. Given a novel sentence St es t 6 Ctest, combine 
the collection of hypotheses U — fi(s tes t) us- 
in g the unweighted constitue nt voting scheme 
of Henderson and Brill (199E). 



2.3 Experiment 

The training set for these exp eriments was section s 
01-21 of the Penn Treebank ( |Marcus et al., 1993| ). 
The test set was section 23. The parser induction 
algorithm used in all of the experiments in this pa- 
per was a dist ribution of Collins's model 2 parser 
flCollins, 1997j) . All comparisons made below refer 
to results we obtained using Collins's parser. 

The results for bagging are shown in Figure || and 
Table [j]. The row of figures are (from left-to-right) 
training set F-measuref], test set F-measure, percent 
perfectly parsed sentences in training set, and per- 
cent perfectly parsed sentences in test set. An en- 
semble of bags was produced one bag at a time. In 
the table, the Initial row shows the performance 
achieved when the ensemble contained only one bag, 
Final (X) shows the performance when the ensem- 
ble contained X bags, BestF gives the performance 
of the ensemble size that gave the best F-measure 
score. TrainBestF and TestBestF give the test set 
performance for the ensemble size that performed 
the best on the training and test sets, respectively. 

On the training set all of the accuracy measures 
are improved over the original parser, and on the 
test set there is clear improvement in precision and 
recall. The improvement on exact sentence accuracy 
for the test set is significant, but only marginally so. 

The overall gain achieved on the test set by bag- 
ging was 0.8 units of F-measure, but because the 
entire corpus is not used in each bag the initial per- 
formance is approximately 0.2 units below the best 
previously reported result. The net gain using this 
technique is 0.6 units of F-measure. 

3 Boosting 
3.1 Background 

The AdaBoost algorithm was presented by Fre- 
und and Schapire in 1996 (Freund and Schapire J 



1996j ; [Freund and Schapire, 1997| ) and has become a 
widely-known successful method in machine learn- 
ing. The AdaBoost algorithm imposes one con- 
straint on its underlying learner: it may abstain from 
making predictions about labels of some samples, 



but it must consistently be able to get more than 
50% accuracy on the samples for which it commits 
to a decision. That accuracy is measured accord- 
ing to the distribution describing the importance of 
samples that it is given. The learner must be able 
to get more correct samples than incorrect samples 
by mass of importance on those that it labels. This 
statement of th e restr iction comes from Schapire and 
Singer's study ( 1998 ). It is called the weak learning 
criterion. 



1 This is the balanced version of F-measure, where precision 
and recall are weighted equally. 



Schapire and Singer (199S) extended AdaBoost by 
describing how to choose the hypothesis mixing co- 
efficients in certain circumstances and how to incor- 
porate a general notion of confidence scores. They 
also provided a better characterization of its theo- 
retical performance. The version of AdaBoost used 
in their work is shown in Algorithm ||, as it is the 
version that most amenable to parsing. 

Algorithm: AdaBoost 

(Freund and Schapire, 1997) (3) 

Given: Training set £ as in bagging, except yi £ 
{ — 1, 1} is the label for example a?j. Initial uniform 
distribution D\(i) = 1/m. Number of iterations, T. 
Counter t = 1. 'J, $, and 4> are as in Bagging. 

1. Create L t by randomly choosing with replace- 
ment m samples from £ using distribution D t . 

2. Classifier induction: <j) t <— ^f(L t ) 

3. Choose a t E M. 

4. Adjust and normalize the distribution. Z t is a 
normalization coefficient. 

Dt+i(i) = —D t (i) exp(-a t yi(f>t(xi)) 

5. Increment t. Quit if t > T. 

6. Repeat from step [j]. 

7. The final hypothesis is 

(f>boost(x) = sign^a t t (a;) 
t 

The value of at should generally be chosen to min- 
imize 

^ D t (i) exp(-a t yi(j>t(xi)) 

i 

in order to minimize the expected per-sample train- 
ing error of the ensemble, which Schapire and Singer 

show can be concisely expressed by FJ Z t . They also 

t 

give several examples for how to pick an appropriate 
a, and selection generally depends on the possible 
outputs of the underlying learner. 

Boosting has bee n used in a few NLP systems. 
Haruno et al. (199g| ) used boosting to produce more 
accurate classifiers which were embedded as control 



Set 


Instance 


P R 


F Gain 


Exact Gain 


Training 


Original Parser 
Initial 
BestF(15) 
Final(15) 


96.25 96.31 
93.61 93.63 
96.16 95.86 
96.16 95.86 


96.28 NA 
93.62 0.00 
96.01 2.39 
96.01 2.39 


64.7 NA 
55.5 0.0 
62.1 6.6 
62.1 6.6 


Test 


Original Parser 
Initial 

TrainBestF(15) 

TestBestF(13) 

Final(15) 


88.73 88.54 
88.43 88.34 

89.54 88.80 

89.55 88.84 
89.54 88.80 


88.63 NA 
88.38 0.00 
89.17 0.79 
89.19 0.81 
89.17 0.79 


34.9 NA 
33.3 0.0 

34.6 1.3 

34.7 1.4 
34.6 1.3 



Table 1: Bagging the Treebank 



mechanisms of a parser for Japanese. The creators 
of AdaBoost used it to per fo rm text classification 
([Schapire and Singer, 2000|) . [Abney et al. (1999|) 
performed part-of-speech tagging and prepositional 
phrase attachment using AdaBoost as a core compo- 
nent. They found they could achieve accuracies on 
both tasks that were competitive with the state of 
the art. As a side effect, they found that inspecting 
the samples that were consistently given the most 
weight during boosting revealed some faulty anno- 
tations in the corpus. In all of these systems, Ad- 
aBoost has been used as a traditional classification 
system. 

3.2 Boosting for Parsing 

Our goal is to recast boosting for parsing while con- 
sidering a parsing system as the embedded learner. 
The formulation is given in Algorithm ||. The in- 
tuition behind the additive form is that the weight 
placed on a sentence should be the sum of the weight 
we would like to place on its constituents. The 
weight on constituents that are predicted incorrectly 
are adjusted by a factor of 1 in contrast to a factor 
of a for those that are predicted incorrectly. 

Algorithm: Boosting A Parser (4) 

Given corpus C with size m = \C\ = J2 S r^( s >^) 
and parser induction algorithm g. Initial uniform 
distribution D\{i) = 1/m. Number of iterations, T. 
Counter t = 1. 

1. Create Ct by randomly choosing with replace- 
ment to samples from C using distribution D t . 

2. Create parser ft <— g(Ct). 

3. Choose at € K (described below). 

4. Adjust and normalize the distribution. Z t is 
a normalization coefficient. For all i, let parse 
tree r[ <— ft{si). Let S(t,c) be a function indi- 
cating that c is in parse tree r, and |t| is the 
number of constituents in tree r. T(s) is the set 
of constituents that are found in the reference 
or hypothesized annotation for s. 

Dt+i(i) = 

i-A(») £ {a+(l-a)\5(rl,c)-5(n,c)\) 



5. Increment t. Quit if t > T. 

6. Repeat from step |l|. 

7. The final hypothesis is computed by combin- 
ing the individual constituents. Each parser <pt 
in the ensemble gets a vote with weight at for 
the constituents they predict. Precisely those 
constituents with weight strictly larger than 
2 St a t are Put into the final hypothesis. 

A potential constituent can be considered correct 
if it is predicted in the hypothesis and it exists in 
the reference, or it is not predicted and it is not in 
the reference. Potential constituents that do not ap- 
pear in the hypothesis or the reference should not 
make a big contribution to the accuracy computa- 
tion. There are many such potential constituents, 
and if we were maximizing a function that treated 
getting them incorrect the same as getting a con- 
stituent that appears in the reference correct, we 
would most likely decide not to predict any con- 
stituents. 

Our model of constituent accuracy is thus sim- 
ple. Each prediction correctly made over T(s) will be 
given equal weight. That is, correctly hypothesizing 
a constituent in the reference will give us one point, 
but a precision or recall error will cause us to miss 
one point. Constituent accuracy is then a/ (a + b+c), 
where a is the number of constituents correctly hy- 
pothesized, b is the number of precision errors and c 
is the number of recall errors. 

In Equation [|, a computation of a ca as described 
is shown. 



E||[ E S(T i ,c) + 6(Tl,c)-25(n,c)6(Tl,c) 



Boosting algorithms were developed that at- 
tempted to maximize F-measure, precision, and re- 
call by varying the computation of a, giving results 
too numerous to include here. The algorithm given 
here performed the best of the lot, but was only 
marginally better for some metrics. 



Set 


Instance 


P R 


F Gain 


Exact Gain 


Training 


Original Parser 
Initial 
BestF(15) 
Final(15) 


96.25 96.31 
93.54 93.61 
96.21 95.79 
96.21 95.79 


96.28 NA 
93.58 0.00 
96.00 2.42 
96.00 2.42 


64.7 NA 

54.8 0.0 
57.3 2.5 
57.3 2.5 


Test 


Original Parser 
Initial 

TrainBestF(15) 

TestBestF(14) 

Final(15) 


88.73 88.54 
88.05 88.09 
89.37 88.32 
89.39 88.41 
89.37 88.32 


88.63 NA 
88.07 0.00 
88.84 0.77 
88.90 0.83 
88.84 0.77 


34.9 NA 

33.3 0.0 
33.0 -0.3 

33.4 0.1 
33.0 -0.3 



Table 2: Boosting the Treebank 



3.3 Experiment 

The experimental results for boosting are shown in 
Figure | and Table [|. There is a large plateau in 
performance from iterations 5 through 12. Because 
of their low accuracy and high degree of specializa- 
tion, the parsers produced in these iterations had 
little weight during voting and had little effect on 
the cumulative decision making. 

As in the bagging experiment, it appears that 
there would be more precision and recall gain to 
be had by creating a larger ensemble. In both the 
bagging and boosting experiments time and resource 
constraints dictated our ensemble size. 

In the table we see that the boosting algorithm 
equaled bagging's test set gains in precision and re- 
call. The Initial performance for boosting was 
lower, though. We cannot explain this, and expect 
it is due to unfortunate resampling of the data dur- 
ing the first iteration of boosting. Exact sentence 
accuracy, though, was not significantly improved on 
the test set. 

Overall, we prefer bagging to boosting for this 
problem when raw performance is the goal. There 
are side effects of boosting that are useful in other 
respects, though, which we explore in Section 4.2. 



3.3.1 Weak Learning Criterion Violations 

It was hypothesized in the course of investigating the 
failures of the boosting algorithm that the parser in- 
duction system did not satisfy the weak learning cri- 
terion. It was noted that the distribution of boosting 
weights were more skewed in later iterations. Inspec- 
tion of the sentences that were getting much mass 
placed upon them revealed that their weight was be- 
ing boosted in every iteration. The hypothesis was 
that the parser was simply unable to learn them. 

39832 parsers were built to test this, one for each 
sentence in the training set. Each of these parsers 
was trained on only a single sentence^] and evaluated 
on the same sentence. It was discovered that a full 
4764 (11.2%) of these sentences could not be parsed 
completely correctly by the parsing system. 



2 The sentence was replicated 10 times to avoid threshold- 
ing effects in the learner. 



3.3.2 Corpus Trimming 

In order to evaluate how well boosting worked with 
a learner that better satisfied the weak learning cri- 
terion, the boosting experiment was run again on 
the Treebank minus the troublesome sentences de- 
scribed above. The results are in Table § This 
dataset produces a larger gain in comparison to the 
results using the entire Treebank. The initial ac- 
curacy, however, is lower. We hypothesize that the 
boosting algorithm did perform better here, but the 
parser induction system was learning useful informa- 
tion in those sentences that it could not memorize 
(e.g. lexical information) that was successfully ap- 
plied to the test set. 

In this manner we managed to clean our dataset to 
the point that the parser could learn each sentence 
in isolation. The corpus-makers cannot necessarily 
be blamed for the sentences that could not be mem- 
orized. All that can be said about those sentences 
is that for better or worse, the parser's model would 
not accommodate them. 

4 Corpus Analysis 

4.1 Noisy Corpus: Empirical Investigation 

To acquire experimental evidence of noisy data, dis- 
tributions that were used during boosting the sta- 
ble corpus were inspected. The distribution was ex- 
pected to be skewed if there was noise in the data, or 
be uniform with slight fluctuations if it fit the data 
well. 

We see how the boosting weight distribution 
changes in Figure [j]. The individual curves are in- 
dexed by boosting iteration in the key of the figure. 
This training run used a corpus of 5000 sentences. 
The sentences are ranked by the weight they are 
given in the distribution, and sorted in decreasing or- 
der by weight along the x-axis. The distribution was 
smoothed by putting samples into equal weight bins, 
and reporting the average mass of samples in the bin 
as the y-coordinate. Each curve on this graph cor- 
responds to a boosting iteration. We used 1000 bins 
for this graph, and a log scale on the x-axis. Since 
there were 5000 samples, all samples initially had a 
y-value of 0.0002. 



Set 


Instance 


P R 


F Gain 


Exact Gain 


Training 


Original Parser 
Initial 
BestF(8) 
Final(15) 


96.25 96.31 
94.60 94.68 
97.38 97.00 
97.00 96.17 


96.28 NA 
94.64 0.00 
97.19 2.55 
96.58 1.94 


64.7 NA 
62.2 0.0 
63.1 0.9 
55.0 -7.2 


Test 


Original Parser 
Initial 

TrainBestF(8) 

TestBestF(6) 

Final(15) 


88.73 88.54 
87.43 87.21 
89.12 87.62 
89.07 87.77 
89.18 87.19 


88.63 NA 
87.32 0.00 
88.36 1.04 
88.42 1.10 
88.18 0.86 


34.9 NA 

32.6 0.0 

32.8 0.2 

32.9 0.4 

31.7 -0.8 



Table 3: Boosting the Stable Corpus 




Figure 1: Weight Change During Boosting 



Notice first that the left endpoints of the lines 
move from bottom to top in order of boosting it- 
eration. The distribution becomes monotonically 
more skewed as boosting progresses. Secondly we 
see by the last iteration that most of the weight is 
concentrated on less than 100 samples. This graph 
shows behavior consistent with noise in the corpus 
on which the boosting algorithm is focusing. 

4.2 TVeebank Inconsistencies 

There are sentences in the corpus that can be learned 
by the parser induction algorithm in isolation but 
not in concert because they contain conflicting in- 
formation. Finding these sentences leads to a better 
understanding of the quality of our corpus, and gives 
an idea for where i mprovements in annot ation qual- 



ity can be made. Abney et al. (1999 ) showed a 



similar corpus analysis technique for part of speech 
tagging and prepositional phrase tagging, but for 
parsing we must remove e rrors introduced by the 
parser as we did in Section 3.3.2 before questioning 
the corpus quality. A particular class of errors, in- 
consistencies, can then be investigated. Inconsistent 
annotations are those that appear plausible in iso- 
lation, but which conflict with annotation decisions 
made elsewhere in the corpus. 

In Figure || we show a set of trees selected from 



within the top 100 most heavily weighted trees at 
the end of 15 iterations of boosting the stable cor- 
pus. Collins's parser induction system is able to learn 
to produce any one of these structures in isolation, 
but the presence of conflicting information in differ- 
ent sentences prevents it from achieving 100% accu- 
racy on the set. 

5 Training Corpus Size Effects 

We suspect our best parser diversification techniques 
gives performance gain approximately equal to dou- 
bling the size of the training set. While this cannot 
be directly tested without hiring more annotators, 
an expected performance bound for a larger train- 
ing set can be produced by extrapolating from how 
well the parser performs using smaller training sets. 
There are two characteristics of training curves for 
large corpora that can provide such a bound: train- 
ing curves generally increase monotonically in the 
absence of over-training, and their first derivatives 
generally decrease monotonically. 



Set 
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a 
'3 
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Sentences 



50 

100 

500 

1000 

5000 

10000 

20000 

39832 



50 

100 

500 

1000 

5000 
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20000 
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R 



67.57 
69.03 
78.12 
81.36 
87.28 
89.74 
92.42 
96.25 



32.15 
56.23 
75.46 
80.70 
87.09 
89.56 
92.40 
96.31 



68.13 
69.90 
78.72 
81.61 
86.03 
87.29 
87.99 
88.73 



32.24 
54.19 
75.33 
80.68 
85.43 
86.81 
87.87 
88.54 



43.57 
61.98 
76.77 
81.03 
87.19 
89.65 
92.41 
96.28 



43.76 
61.05 
76.99 
81.14 
85.73 
87.05 
87.93 
88.63 



Exact 



5.4 
8.5 
18.2 
22.9 
34.1 
41.0 
50.3 
64.7 



4.7 
7.8 
19.1 
22.2 
28.6 
30.8 
32.7 
34.9 



Table 4: Effects of Varying Training Corpus Size 



The training curves we present in Figure [I] and Ta- 
ble suggest that roughly doubling the corpus size 



in the range of interest (between 10000 and 40000 
sentences) gives a test set F-measure gain of approx- 
imately 0.70. 

Bagging achieved significant gains of approxi- 
mately 0.60 over the best reported previous F- 
measure without adding any new data. In this re- 
spect, these techniques show promise for making 
performance gains on large corpora without adding 
more data or new parsers. 

6 Conclusion 

We have shown two methods, bagging and boosting, 
for automatically creating ensembles of parsers that 
produce better parses than any individual in the en- 
semble. Neither of the algorithms exploit any spe- 
cialized knowledge of the underlying parser induc- 
tion algorithm, and the data used in creating the 
ensembles has been restricted to a single common 
training set to avoid issues of training data quantity 
affecting the outcome. 

Our best bagging system performed consistently 
well on all metrics, including exact sentence accu- 
racy. It resulted in a statistically significant F- 
measure gain of 0.6 over the performance of the base- 
line parser. That baseline system is the best known 
Treebank parser. This gain compares favorably with 
a bound on potential gain from increasing the corpus 
size. 

Even though it is computationally expensive to 
create and evaluate a small (15-30) ensemble of 
parsers, the cost is far outweighed by the opportu- 
nity cost of hiring humans to annotate 40000 more 
sentences. The economic basis for using ensemble 
methods will continue to improve with the increasing 
value (performance per price) of modern hardware. 

Our boosting system, although dominated by the 
bagging system, also performed significantly better 
than the best previously known individual parsing 
result. We have shown how to exploit the distri- 
bution created as a side-effect of the boosting al- 
gorithm to uncover inconsistencies in the training 
corpus. A semi-automated technique for doing this 
as well as examples from the Treebank that are in- 
consistently annotated were presented. Perhaps the 
biggest advantage of this technique is that it requires 
no a priori notion of how the inconsistencies can be 
characterized. 
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Figure 4: Effects of Varying Training Corpus Size 




Figure 5: An inconsistent set of annotations from the Treebank 



